Statistical Analysis of the Indus Script Using n-Grams

نویسندگان

Nisha Yadav

Hrishikesh Joglekar

Rajesh P. N. Rao

Mayank N. Vahia

Iravatham Mahadevan

Ronojoy Adhikari

چکیده

The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Indus Texts using K-means

One of the most important undeciphered scripts of the ancient world is the Indus script. Earlier studies had focused on the correlations between signs in the Indus texts using various statistical and computational techniques such as N-grams or Markov chains. In the present study, K-means clustering, an unsupervised machine learning technique is used to identify clusters of similar texts without...

متن کامل

A Markov Model of the 4500-year-old Indus Script

Although no historical information exists about the Indus civilization (fl. c. 2600-1900 BC), archaeologists have uncovered about 3800 short samples of a script that was used throughout the civilization. The script remains undeciphered, despite a large number of attempts and claimed decipherments over the past 80 years. Here, we propose the use of probabilistic models to analyze the structure o...

متن کامل

A Markov model of the Indus script.

Although no historical information exists about the Indus civilization (flourished ca. 2600-1900 B.C.), archaeologists have uncovered about 3,800 short samples of a script that was used throughout the civilization. The script remains undeciphered, despite a large number of attempts and claimed decipherments over the past 80 years. Here, we propose the use of probabilistic models to analyze the ...

متن کامل

Indus Script: A Study of its Sign Design

The Indus script is an undeciphered script of the ancient world. In spite of numerous attempts over several decades, the script has defied universally acceptable decipherment. In a recent series of papers (Yadav et al. 2010; Rao et al. 2009a, b; Yadav et al. 2008a, b) we have analysed the sequences of Indus signs which demonstrate presence of a rich syntax and logic in its structure. Here we fo...

متن کامل

Entropic evidence for linguistic structure in the Indus script.

The script of the ancient Indus civilization remains undeciphered. The hypothesis that the script encodes language has recently been questioned. Here, we present evidence for the linguistic hypothesis by showing that the script's conditional entropy is closer to those of natural languages than various types of nonlinguistic systems.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 5 شماره

صفحات -

تاریخ انتشار 2010

Statistical Analysis of the Indus Script Using n-Grams

نویسندگان

چکیده

منابع مشابه

Clustering Indus Texts using K-means

A Markov Model of the 4500-year-old Indus Script

A Markov model of the Indus script.

Indus Script: A Study of its Sign Design

Entropic evidence for linguistic structure in the Indus script.

عنوان ژورنال:

اشتراک گذاری